This script follows the steps necessary to reproduce the main results in Camblor et al., 2022. Mentions to Tables and Figures refer to those of this work.

Libraries

library(readxl)
library(tidyverse)
library(HardyWeinberg)
library(epitools)
library(ggsignif)
library(plotly)

1 Data loading and processing

Our study focused on COVID-19 patients admitted to ICU. The data regarding their characterization, clinical variables and genotypes were collected in an Excel file, which can be loaded in R with:

all <- read_excel("patients_data.xlsx")

For this analysis, we want to extract the main clinical and genetic covariables that are in our interest:

data <- all %>% 
  select(EXITUS, AGE, SEX, HTN, DM, HLP, IL6_gt70, DD_gt2000, NFKB1, NFKBIA, NFKBIZ) %>% 
  mutate_if(is.character, as.factor)

With each of the variables standing for:

  • EXITUS: expresses whether the patient has died or not.
  • AGE: self-explanatory.
  • SEX: self-explanatory (male/female distinction).
  • HTN: determines if the patient has hypertension or not.
  • DM: determines if the patient has Diabetes Mellitus or not.
  • HLP: determines if the patient has hyperlipidemia or not.
  • IL6_gt70: in relation to the patient’s levels of IL-6 (a marker for inflammation and severe COVID-19), indicates whether they are superior to 70 pg/mL (a cutoff associated to higher risk of death in COVID-19) or not.
  • DD_gt2000: in relation to the patient’s levels of D-Dimer (a marker for fibrinolysis), indicates whether they are superior to 2000 ng/mL (a cutoff associated to higher risk of death in COVID-19) or not.
  • NFKB1: genotype of the candidate variant rs28362491 (indel) for the NFKB1 gene. Common allele: I, insertion. Rare allele: D, deletion.
  • NFKBIA: genotype of the candidate variant rs696 (G>A) for the NFKBIA gene. Common allele: G. Rare allele: A.
  • NFKBIZ: genotype of the candidate variant rs3217713 (indel) for the NFKBIZ gene. Common allele: I, insertion. Rare allele: D, deletion.

As part of this study, 300 controls were also genotyped for their comparison with the ICU COVID-19 patients. These data were also stored in an Excel file:

control_data <- read_excel("controls_data.xlsx")

Before the analysis, we can declare some variables to act as global definitions:

STUDY_GROUPS <- c("Survivors", "Deceased", "Patients", "Controls")
# Survivors and Deceased refer to ICU patients (EXITUS = NO/YES)
# Patients are the total of the ICU individuals

CLIN_VAR <- colnames(data)[!colnames(data) %in% c("EXITUS", "NFKB1", "NFKBIA", "NFKBIZ")]
CAT_CLIN_VAR <- CLIN_VAR[-which(CLIN_VAR=="AGE")]  # Categorical clinical variables

GENES <- c("NFKB1", "NFKBIA", "NFKBIZ")

SIG_LVL <- 0.05

2 Data description

We can inspect our data at a glance with:

summary(data)
##  EXITUS         AGE            SEX       HTN         DM        HLP     
##  NO :371   Min.   :11.00   FEMALE:129   NO :214   NO  :352   NO  :240  
##  YES: 99   1st Qu.:57.00   MALE  :341   YES:256   YES :104   YES :216  
##            Median :65.00                          NA's: 14   NA's: 14  
##            Mean   :63.41                                               
##            3rd Qu.:73.00                                               
##            Max.   :84.00                                               
##  IL6_gt70   DD_gt2000  NFKB1    NFKBIA   NFKBIZ  
##  NO  :253   NO  :309   DD: 78   AA: 81   DD: 22  
##  YES :164   YES :101   ID:213   GA:215   ID:158  
##  NA's: 53   NA's: 60   II:179   GG:174   II:290  
##                                                  
##                                                  
## 

The total number of COVID-19 ICU patients we are studying is 470. We can highlight a median age of 65 and the predominance of males in the ICU (both age and male sex are risk factors for severe COVID-19). Indeed, the distribution of ages is highly left-skewed:

ggplot(data, aes(x=AGE)) +
  geom_histogram(aes(y = ..density..), color="black", fill="white") +
  geom_density(color="red", alpha=0.1, fill="red") +
  labs(x="Age", y="Density") +
  theme_bw()

Indicating, as expected, that COVID-19 ICU patients tend to be of higher ages.

NA values arise for some of the clinical variables (DM, HLP, IL6_gt70, DD_gt2000), as they were not measured in all the patients. We put our focus on explaining the variable EXITUS —representing the COVID-19 severity— as a function of the rest of the variables.

In Table 1 we show the frequencies in each of the study groups (EXITUS = YES/NO), which can be obtained with the following code. The frequencies can be computed with:

clin_freqs <- list()

for (var in CAT_CLIN_VAR) {
  clin_freqs[["absolute"]][[var]] <- table(data[[var]], data$EXITUS)
  clin_freqs[["relative"]][[var]] <- apply(clin_freqs[["absolute"]][[var]], 2, function(x){x/sum(x)})
}

Accessing individual values in the console with:

clin_freqs$relative$SEX
##         
##                 NO       YES
##   FEMALE 0.2722372 0.2828283
##   MALE   0.7277628 0.7171717

Which leaves us with the following summaries, included in Table 1:

Absolute

risk_groups <- lapply(clin_freqs$absolute, function(x){x[2,]})  # Extract the group of risk
abs_freq_summary <- do.call(rbind.data.frame, risk_groups)  # Join data in a df
colnames(abs_freq_summary) <- STUDY_GROUPS[1:2]
rownames(abs_freq_summary) <- c("SEX (MALE)", CAT_CLIN_VAR[-which(CAT_CLIN_VAR=="SEX")])
abs_freq_summary
Survivors Deceased
SEX (MALE) 270 71
HTN 192 64
DM 83 21
HLP 164 52
IL6_gt70 122 42
DD_gt2000 77 24

Relative

risk_groups <- lapply(clin_freqs$relative, function(x){x[2,]})  # Extract the group of risk
rel_freq_summary <- do.call(rbind.data.frame, risk_groups)  # Join data in a df
colnames(rel_freq_summary) <- STUDY_GROUPS[1:2]
rownames(rel_freq_summary) <- c("SEX (MALE)", CAT_CLIN_VAR[-which(CAT_CLIN_VAR=="SEX")])
rel_freq_summary
Survivors Deceased
SEX (MALE) 0.7277628 0.7171717
HTN 0.5175202 0.6464646
DM 0.2286501 0.2258065
HLP 0.4517906 0.5591398
IL6_gt70 0.3588235 0.5454545
DD_gt2000 0.2436709 0.2553191

Finally, the mean ages ± standard deviation in the study groups (also included in Table 1) are:

data %>% 
  group_by(EXITUS) %>% 
  summarize(Mean=mean(AGE), SD=sd(AGE))
EXITUS Mean SD
NO 61.58760 12.42468
YES 70.26263 10.23159

3 Clinical variable associations

In order to explore the relationship between the clinical variables and EXITUS in our data, we performed a univariate logistic regression for each of them as independent variables, placing EXITUS as the dependent variable.

Each of the logistic regressions can be gathered in a single list with:

exitus_clinical_lrs <- list()

for (var in CLIN_VAR) {
  exitus_clinical_lrs[[var]] <- glm(data$EXITUS ~ data[[var]], family="binomial")
}

Accessing the elements in the console, specially for a summary, with:

summary(exitus_clinical_lrs[["SEX"]])
## 
## Call:
## glm(formula = data$EXITUS ~ data[[var]], family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6996  -0.6833  -0.6833  -0.6833   1.7715  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.28292    0.21358  -6.007 1.89e-09 ***
## data[[var]]MALE -0.05283    0.25180  -0.210    0.834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.91  on 469  degrees of freedom
## Residual deviance: 483.87  on 468  degrees of freedom
## AIC: 487.87
## 
## Number of Fisher Scoring iterations: 4

A summary table of the logistic regression results, which was incorporated in Table 1 of the study, was made using the following code:

glm_table <- data.frame("p-value"=numeric(0), "OR"=numeric(0), "Upper_95CI"=numeric(0), "Lower_95CI"=numeric(0))

for (var in CLIN_VAR) {
  fit <- glm(data$EXITUS ~ data[[var]], family="binomial")
  
  p <- summary(fit)$coefficients[2,4]
  or <- exp(coef(fit))[2]
  ub <- exp(confint(fit))[2,1]
  lb <- exp(confint(fit))[2,2]
  
  glm_table[var,] <- c(p, or, ub, lb)
}

glm_table
p.value OR Upper_95CI Lower_95CI
AGE 0.0000000 1.0819502 1.0552660 1.111718
SEX 0.8338289 0.9485450 0.5840536 1.572021
HTN 0.0229231 1.7047619 1.0825549 2.720629
DM 0.9534990 0.9839357 0.5603686 1.672260
HLP 0.0653935 1.5389649 0.9745987 2.443747
IL6_gt70 0.0028068 2.1442623 1.3021251 3.551732
DD_gt2000 0.8180296 1.0641929 0.6175902 1.789455

In the case of the age, a logistic regression answers a different question than what we are interested in: to test the differences in means. The Shapiro-Wilk test is convincing in the lack of normality in the AGE data, but a t-test can still be performed as per the Central Limit Theorem, which considers the approximation of sample means to normality in large samples (in this case, we have groups of N=371 and N=99).

t.test(data$AGE ~ data$EXITUS)  # Welch approximation (variances unequal)
## 
##  Welch Two Sample t-test
## 
## data:  data$AGE by data$EXITUS
## t = -7.1465, df = 182.8, p-value = 2.054e-11
## alternative hypothesis: true difference in means between group NO and group YES is not equal to 0
## 95 percent confidence interval:
##  -11.07006  -6.27999
## sample estimates:
##  mean in group NO mean in group YES 
##          61.58760          70.26263

A very low p-value (<< 0.05) makes certain a difference in the mean ages: deceased COVID-19 patients at the ICU are of an older age.

4 Genetic association study

4.1 Obtaining the genotype and allele frequencies

In order to extract the genotype and allele frequencies —both absolute and relative— from the study data, the variant_freqs function was written.

Show function code
#' Variant frequencies
#' 
#' Compute the genetic (genotype and allele) frequencies for variant genotype observations stored 
#' in a vector
#' 
#' @param variant A chr vector of genotypes
#' @param group A grouping vector for the genotypes (optional)
#' @param margin How the relative frequencies will be computed with grouping variables: 
#' 2, genotype proportions for each group; 1, distribution among groups for each genotype
#' 
#' @return A list with both absolute and relative frequencies for genotypes and alleles 

variant_freqs <- function(variant, group=NULL, margin=2) {
  frequencies <- list()  
  
  if (is.null(group)) {margin=2}  # Force margin 2 if only the genotypes are provided
  
  # Compute the frequencies for the genotypes. To simplify the code, it will be working with 
  # matrices: if the input of the user is only a variant, create a temporary column = c(1, 1, 1)
  if (is.null(group)) {frequencies$genotype_abs <- cbind(table(variant), c(1, 1, 1))}
  else {frequencies$genotype_abs <- table(variant, group)}
  frequencies$genotype_rel <- apply(frequencies$genotype_abs, margin, function(x){x/sum(x)})
  
  # Compute the frequencies for the alleles: 2 * homozygote count + heterozygote count
  first_allele_freq <- frequencies$genotype_abs[3,] * 2 + frequencies$genotype_abs[2,]
  second_allele_freq <- frequencies$genotype_abs[1,] * 2 + frequencies$genotype_abs[2,]
  frequencies$allele_abs <- rbind(first_allele_freq, second_allele_freq)
  rownames(frequencies$allele_abs) <- c(substr(rownames(frequencies$genotype_abs)[3], 1, 1),
                                        substr(rownames(frequencies$genotype_abs)[1], 1, 1))
  frequencies$allele_rel <- apply(frequencies$allele_abs, margin, function(x){x/sum(x)})
  
  # If no group was provided, remove the temporary column
  if (is.null(group)) {frequencies <- lapply(frequencies, function(x){x[,1]})}
  
  return(frequencies)
}

Being able to compute the frequencies among the ICU patients with:

NFKB1 <- variant_freqs(data$NFKB1, data$EXITUS)
NFKBIA <- variant_freqs(data$NFKBIA, data$EXITUS)
NFKBIZ <- variant_freqs(data$NFKBIZ, data$EXITUS)

The total frequencies for the ICU patients are computed with:

NFKB1_ICU <- variant_freqs(data$NFKB1)
NFKBIA_ICU <- variant_freqs(data$NFKBIA)
NFKBIZ_ICU <- variant_freqs(data$NFKBIZ)

And finally, the frequencies for the controls are:

NFKB1_controls <- variant_freqs(control_data$NFKB1)
NFKBIA_controls <- variant_freqs(control_data$NFKBIA)
NFKBIZ_controls <- variant_freqs(control_data$NFKBIZ)

This leaves us with the following frequency summaries, which were used for Table 3:

NFKB1

The absolute frequencies for rs28362491 are:

NFKB1_abs <- as.data.frame(
  rbind(cbind(t(NFKB1$genotype_abs), t(NFKB1$allele_abs)),
        c(NFKB1_ICU$genotype_abs, NFKB1_ICU$allele_abs),
        c(NFKB1_controls$genotype_abs, NFKB1_controls$allele_abs))
)
rownames(NFKB1_abs) <- STUDY_GROUPS; NFKB1_abs
DD ID II I D
Survivors 62 166 143 452 290
Deceased 16 47 36 119 79
Patients 78 213 179 571 369
Controls 46 142 112 366 234

The relative frequencies are:

NFKB1_rel <- as.data.frame(
  rbind(cbind(t(NFKB1$genotype_rel), t(NFKB1$allele_rel)),
      c(NFKB1_ICU$genotype_rel, NFKB1_ICU$allele_rel),
      c(NFKB1_controls$genotype_rel, NFKB1_controls$allele_rel))
)
rownames(NFKB1_rel) <- STUDY_GROUPS; round(NFKB1_rel, 2)
DD ID II I D
Survivors 0.17 0.45 0.39 0.61 0.39
Deceased 0.16 0.47 0.36 0.60 0.40
Patients 0.17 0.45 0.38 0.61 0.39
Controls 0.15 0.47 0.37 0.61 0.39

NFKBIA

The absolute frequencies for rs696 are:

NFKBIA_abs <- as.data.frame(
  rbind(cbind(t(NFKBIA$genotype_abs), t(NFKBIA$allele_abs)),
      c(NFKBIA_ICU$genotype_abs, NFKBIA_ICU$allele_abs),
      c(NFKBIA_controls$genotype_abs, NFKBIA_controls$allele_abs))
)
rownames(NFKBIA_abs) <- STUDY_GROUPS; NFKBIA_abs
AA GA GG G A
Survivors 64 175 132 439 303
Deceased 17 40 42 124 74
Patients 81 215 174 563 377
Controls 61 149 90 329 271

The relative frequencies are:

NFKBIA_rel <- as.data.frame(
  rbind(cbind(t(NFKBIA$genotype_rel), t(NFKBIA$allele_rel)),
      c(NFKBIA_ICU$genotype_rel, NFKBIA_ICU$allele_rel),
      c(NFKBIA_controls$genotype_rel, NFKBIA_controls$allele_rel))
)
rownames(NFKBIA_rel) <- STUDY_GROUPS; round(NFKBIA_rel, 2)
AA GA GG G A
Survivors 0.17 0.47 0.36 0.59 0.41
Deceased 0.17 0.40 0.42 0.63 0.37
Patients 0.17 0.46 0.37 0.60 0.40
Controls 0.20 0.50 0.30 0.55 0.45

NFKBIZ

The absolute frequencies for rs3217713 are:

NFKBIZ_abs <- as.data.frame(
  rbind(cbind(t(NFKBIZ$genotype_abs), t(NFKBIZ$allele_abs)),
      c(NFKBIZ_ICU$genotype_abs, NFKBIZ_ICU$allele_abs),
      c(NFKBIZ_controls$genotype_abs, NFKBIZ_controls$allele_abs))
)
rownames(NFKBIZ_abs) <- STUDY_GROUPS; NFKBIZ_abs
DD ID II I D
Survivors 19 133 219 571 171
Deceased 3 25 71 167 31
Patients 22 158 290 738 202
Controls 12 87 189 465 111

The relative frequencies are:

NFKBIZ_rel <- as.data.frame(
  rbind(cbind(t(NFKBIZ$genotype_rel), t(NFKBIZ$allele_rel)),
      c(NFKBIZ_ICU$genotype_rel, NFKBIZ_ICU$allele_rel),
      c(NFKBIZ_controls$genotype_rel, NFKBIZ_controls$allele_rel))
)
rownames(NFKBIZ_rel) <- STUDY_GROUPS; round(NFKBIZ_rel, 2)
DD ID II I D
Survivors 0.05 0.36 0.59 0.77 0.23
Deceased 0.03 0.25 0.72 0.84 0.16
Patients 0.05 0.34 0.62 0.79 0.21
Controls 0.04 0.30 0.66 0.81 0.19

4.2 Hardy-Weinberg Equilibrium testing

Hardy-Weinberg Equilibrium (HWE) testing is commonly used in genetic association studies as a quality control, as departure from its criteria indicates an underlying unwanted variation that may lead to problems in the interpretation of the results.

The testing is performed for each set of genotype frequencies by means of a Chi-squared test against the expected frequencies that would result of meeting the HWE criteria. We can obtain a simple summary of p-values with:

# The get() function is used to translate the strings in GENES to variable names
for (gene in GENES) {
  print(paste("-----", gene, "----"))
  cat("Survivors:\n  ")
  print(
    HWChisq(get(gene)$genotype_abs[,"NO"], verbose=FALSE)$pval
  )
  cat("Deceased:\n  ")
  print(
    HWChisq(get(gene)$genotype_abs[,"YES"], verbose=FALSE)$pval
  )
  cat("ICU:\n  ")
  print(
    HWChisq(get(paste0(gene, "_ICU"))$genotype_abs, verbose=FALSE)$pval
  )
  cat("Controls:\n  ")
  print(
    HWChisq(get(paste0(gene, "_controls"))$genotype_abs, verbose=FALSE)$pval
  )
  cat("\n")
}
## [1] "----- NFKB1 ----"
## Survivors:
##   [1] 0.2804542
## Deceased:
##   [1] 0.9372743
## ICU:
##   [1] 0.3149684
## Controls:
##   [1] 0.9757332
## 
## [1] "----- NFKBIA ----"
## Survivors:
##   [1] 0.7053844
## Deceased:
##   [1] 0.2306533
## ICU:
##   [1] 0.3349555
## Controls:
##   [1] 0.9599656
## 
## [1] "----- NFKBIZ ----"
## Survivors:
##   [1] 0.9262173
## Deceased:
##   [1] 0.8924809
## ICU:
##   [1] 0.9632842
## Controls:
##   [1] 0.7360303

With all the genotypes meeting the HWE criteria (p-value > 0.05), we can now proceed to the proper genetic association analysis.

4.3 Genotype frequencies association to EXITUS

In this section we’ll perform the association (Chi-squared) tests between the genotype frequencies and the EXITUS variable, outputting some of the results placed in Table 3.

Deceased vs. survivors

We will begin analyzing the deceased vs. survivors data (again, COVID-19 patients in the ICU). A preliminary Chi-squared test can be performed for the genotypes in each of the three polymorphisms with:

chisq_NFKB1 <- chisq.test(NFKB1$genotype_abs); chisq_NFKB1
## 
##  Pearson's Chi-squared test
## 
## data:  NFKB1$genotype_abs
## X-squared = 0.24042, df = 2, p-value = 0.8867
chisq_NFKBIA <- chisq.test(NFKBIA$genotype_abs); chisq_NFKBIA
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIA$genotype_abs
## X-squared = 1.7712, df = 2, p-value = 0.4125
chisq_NFKBIZ <- chisq.test(NFKBIZ$genotype_abs); chisq_NFKBIZ
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIZ$genotype_abs
## X-squared = 5.3789, df = 2, p-value = 0.06792

Observing a trend towards significance in the NFKBIZ rs3217713 genotypes. Exploring the deviation in the observed values from the expected values in each case, we can group genotypes based on similar trends to further improve the statistical power in the tests to follow. This way:

chisq_NFKB1$observed-chisq_NFKB1$expected
##        group
## variant         NO        YES
##      DD  0.4297872 -0.4297872
##      ID -2.1340426  2.1340426
##      II  1.7042553 -1.7042553

For the NFKB1 rs28362491 genotypes, heterozygotes are more frequent in deceased patients vs. survivors: ID / II+DD. However, the interpretation of a heterozygote effect is harder to elucidate, so we chose to group ID heterozygotes to the homozygote group with the nearest value of change; i.e: ID+DD / II.

chisq_NFKBIA$observed-chisq_NFKBIA$expected
##        group
## variant          NO         YES
##      AA  0.06170213 -0.06170213
##      GA  5.28723404 -5.28723404
##      GG -5.34893617  5.34893617

For the NFKBIA rs696 genotypes, GG homozygosis is more frequent in deceased patients vs. survivors: GG / GA+AA.

chisq_NFKBIZ$observed-chisq_NFKBIZ$expected
##        group
## variant        NO       YES
##      DD  1.634043 -1.634043
##      ID  8.280851 -8.280851
##      II -9.914894  9.914894

Finally, for the NFKBIZ rs3217713 genotypes, the II homozygosis is more frequent in deceased patients vs. survivors: II / ID+DD.

As we will be working with them later, these groups can be reflected in the data with:

data <- data %>% 
  mutate(NFKB1_GROUP = ifelse(NFKB1=="II", "II", "ID+DD")) %>% 
  mutate(NFKBIA_GROUP = ifelse(NFKBIA=="GG", "GG", "GA+AA")) %>% 
  mutate(NFKBIZ_GROUP = ifelse(NFKBIZ=="II", "II", "ID+DD"))

Finally, we will be executing the new Chi-squared tests. The default Yates’ correction for continuity will not be considered, as no expected values < 5 were found (and as it is regarded by some authors as too conservative). The final genotype association tests were:

chisq.test(data$EXITUS, data$NFKB1_GROUP, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  data$EXITUS and data$NFKB1_GROUP
## X-squared = 0.15762, df = 1, p-value = 0.6914
chisq.test(data$EXITUS, data$NFKBIA_GROUP, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  data$EXITUS and data$NFKBIA_GROUP
## X-squared = 1.5703, df = 1, p-value = 0.2102
chisq.test(data$EXITUS, data$NFKBIZ_GROUP, correct=FALSE)  # Significant
## 
##  Pearson's Chi-squared test
## 
## data:  data$EXITUS and data$NFKBIZ_GROUP
## X-squared = 5.3234, df = 1, p-value = 0.02104

Where we then found a significant association of the NFKBIZ rs3217713 II genotype to mortality (frequency raised in deceased patients, as previously stated). The odds ratios can be computed with:

oddsratio.wald(t(table(data$EXITUS, data$NFKB1_GROUP)), rev="r")$measure
##        odds ratio with 95% C.I.
##         estimate     lower    upper
##   II    1.000000        NA       NA
##   ID+DD 1.097588 0.6930439 1.738272
oddsratio.wald(t(table(data$EXITUS, data$NFKBIA_GROUP)))$measure
##        odds ratio with 95% C.I.
##         estimate     lower    upper
##   GA+AA 1.000000        NA       NA
##   GG    1.334131 0.8492349 2.095892
oddsratio.wald(t(table(data$EXITUS, data$NFKBIZ_GROUP)))$measure
##        odds ratio with 95% C.I.
##         estimate    lower    upper
##   ID+DD 1.000000       NA       NA
##   II    1.759948 1.084839 2.855186

Patients vs. controls

We can now perform the analyses to compare the genotype frequencies in the total of the ICU patients (deceased + survivors) against our controls.

First, we establish the comparison matrices with the absolute frequencies previously gathered:

NFKB1_genotype_matrix <- cbind(NFKB1_ICU$genotype_abs, NFKB1_controls$genotype_abs)
colnames(NFKB1_genotype_matrix) <- STUDY_GROUPS[3:4]
NFKBIA_genotype_matrix <- cbind(NFKBIA_ICU$genotype_abs, NFKBIA_controls$genotype_abs)
colnames(NFKBIA_genotype_matrix) <- STUDY_GROUPS[3:4]
NFKBIZ_genotype_matrix <- cbind(NFKBIZ_ICU$genotype_abs, NFKBIZ_controls$genotype_abs)
colnames(NFKBIZ_genotype_matrix) <- STUDY_GROUPS[3:4]

We perform an initial set of Chi-squared tests:

chisq_NFKB1_pvc <- chisq.test(NFKB1_genotype_matrix); chisq_NFKB1_pvc
## 
##  Pearson's Chi-squared test
## 
## data:  NFKB1_genotype_matrix
## X-squared = 0.36974, df = 2, p-value = 0.8312
chisq_NFKBIA_pvc <- chisq.test(NFKBIA_genotype_matrix); chisq_NFKBIA_pvc
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIA_genotype_matrix
## X-squared = 4.1826, df = 2, p-value = 0.1235
chisq_NFKBIZ_pvc <- chisq.test(NFKBIZ_genotype_matrix); chisq_NFKBIZ_pvc
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIZ_genotype_matrix
## X-squared = 1.1821, df = 2, p-value = 0.5538

We assess which genotypes can be grouped based on equivalent changes among groups:

chisq_NFKB1_pvc$observed-chisq_NFKB1_pvc$expected  # DD / II+ID (to avoid testing heterozygosis vs. homozygosis)
##     Patients  Controls
## DD  2.311688 -2.311688
## ID -3.688312  3.688312
## II  1.376623 -1.376623
chisq_NFKBIA_pvc$observed-chisq_NFKBIA_pvc$expected  # GG / GA+AA
##     Patients   Controls
## AA -5.675325   5.675325
## GA -7.181818   7.181818
## GG 12.857143 -12.857143
chisq_NFKBIZ_pvc$observed-chisq_NFKBIZ_pvc$expected  # ID+DD / II
##      Patients   Controls
## DD  0.9182058 -0.9182058
## ID  6.0870712 -6.0870712
## II -7.0052770  7.0052770

We establish the groups:

grouped_NFKB1_genotype_matrix <- rbind(NFKB1_genotype_matrix["DD",],
                                       NFKB1_genotype_matrix["ID",] + NFKB1_genotype_matrix["II",])
grouped_NFKBIA_genotype_matrix <- rbind(NFKBIA_genotype_matrix["GG",],
                                        NFKBIA_genotype_matrix["GA",] + NFKBIA_genotype_matrix["AA",])
grouped_NFKBZ_genotype_matrix <- rbind(NFKBIZ_genotype_matrix["ID",] + NFKBIZ_genotype_matrix["DD",],
                                       NFKBIZ_genotype_matrix["II",])

The final genotype association tests were:

chisq.test(grouped_NFKB1_genotype_matrix, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  grouped_NFKB1_genotype_matrix
## X-squared = 0.216, df = 1, p-value = 0.6421
chisq.test(grouped_NFKBIA_genotype_matrix, correct=FALSE)  # Significant
## 
##  Pearson's Chi-squared test
## 
## data:  grouped_NFKBIA_genotype_matrix
## X-squared = 4.0067, df = 1, p-value = 0.04532
chisq.test(grouped_NFKBZ_genotype_matrix, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  grouped_NFKBZ_genotype_matrix
## X-squared = 1.1815, df = 1, p-value = 0.2771

Where we find that, although not conclusively, the p-value for NFKBIA < 0.05 states a significant deviation in the genotype frequencies of the rs696 polymorphism in ICU patients vs. controls: GG is higher in the patients. Finally, we compute the odds ratios:

oddsratio.wald(grouped_NFKB1_genotype_matrix)$measure
##                         NA
## odds ratio with 95% C.I. estimate     lower   upper
##                     [1,] 1.000000        NA      NA
##                     [2,] 1.098713 0.7386336 1.63433
oddsratio.wald(grouped_NFKBIA_genotype_matrix)$measure
##                         NA
## odds ratio with 95% C.I. estimate    lower    upper
##                     [1,] 1.000000       NA       NA
##                     [2,] 1.371622 1.006124 1.869895
oddsratio.wald(grouped_NFKBZ_genotype_matrix)$measure
##                         NA
## odds ratio with 95% C.I. estimate   lower    upper
##                     [1,] 1.000000      NA       NA
##                     [2,] 1.184953 0.87247 1.609355

4.4 Allele frequencies association to EXITUS

Here, we will perform the equivalent calculations for the allele frequencies, completing the results of Table 3.

Deceased vs. survivors

We will be working with the two groups of ICU patients. The Chi-squared tests are directly performed with:

chisq.test(NFKB1$allele_abs, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  NFKB1$allele_abs
## X-squared = 0.043582, df = 1, p-value = 0.8346
chisq.test(NFKBIA$allele_abs, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIA$allele_abs
## X-squared = 0.77976, df = 1, p-value = 0.3772
chisq.test(NFKBIZ$allele_abs, correct=FALSE)  # Significant
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIZ$allele_abs
## X-squared = 5.0581, df = 1, p-value = 0.02451

And then, the odds ratios are computed with:

oddsratio.wald(NFKB1$allele_abs)$measure
##                         NA
## odds ratio with 95% C.I. estimate     lower    upper
##                        I 1.000000        NA       NA
##                        D 1.034715 0.7510609 1.425496
oddsratio.wald(NFKBIA$allele_abs, rev="r")$measure
##                         NA
## odds ratio with 95% C.I. estimate     lower    upper
##                        A  1.00000        NA       NA
##                        G  1.15656 0.8373311 1.597493
oddsratio.wald(NFKBIZ$allele_abs, rev="r")$measure
##                         NA
## odds ratio with 95% C.I. estimate    lower   upper
##                        D 1.000000       NA      NA
##                        I 1.613299 1.060375 2.45454

Patients vs. controls

Now, we are comparing the total of ICU patients against the controls. We establish the matrices, joining the patients and controls data:

NFKB1_allele_matrix <- rbind(NFKB1_ICU$allele_abs, NFKB1_controls$allele_abs)
NFKBIA_allele_matrix <- rbind(NFKBIA_ICU$allele_abs, NFKBIA_controls$allele_abs)
NFKBIZ_allele_matrix <- rbind(NFKBIZ_ICU$allele_abs, NFKBIZ_controls$allele_abs)

The Chi-squared tests are performed with:

chisq.test(NFKB1_allele_matrix, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  NFKB1_allele_matrix
## X-squared = 0.010021, df = 1, p-value = 0.9203
chisq.test(NFKBIA_allele_matrix, correct=FALSE)  # Significant
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIA_allele_matrix
## X-squared = 3.8478, df = 1, p-value = 0.04981
chisq.test(NFKBIZ_allele_matrix, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  NFKBIZ_allele_matrix
## X-squared = 1.0729, df = 1, p-value = 0.3003

And then, the odds ratios are computed with:

oddsratio.wald(NFKB1_allele_matrix, rev="r")$measure
##                         NA
## odds ratio with 95% C.I. estimate     lower    upper
##                     [1,] 1.000000        NA       NA
##                     [2,] 1.010777 0.8194171 1.246826
oddsratio.wald(NFKBIA_allele_matrix)$measure
##                         NA
## odds ratio with 95% C.I. estimate    lower    upper
##                     [1,]   1.0000       NA       NA
##                     [2,]   1.2301 1.000059 1.513057
oddsratio.wald(NFKBIZ_allele_matrix, rev="r")$measure
##                         NA
## odds ratio with 95% C.I. estimate     lower    upper
##                     [1,] 1.000000        NA       NA
##                     [2,] 1.146634 0.8849813 1.485648

4.5 Association of genotypes to clinical variables

To end the association study, we would be interested to test whether the genotypes of the three variants are associated to the clinical variables or not. The frequencies for each genotype were determined using the variant_freqs function, specifying the clinical variables as grouping variables instead of EXITUS; e.g.:

variant_freqs(data$NFKB1, data$SEX, 1)
## $genotype_abs
##        group
## variant FEMALE MALE
##      DD     20   58
##      ID     64  149
##      II     45  134
## 
## $genotype_rel
##         variant
## group           DD        ID        II
##   FEMALE 0.2564103 0.3004695 0.2513966
##   MALE   0.7435897 0.6995305 0.7486034
## 
## $allele_abs
##   FEMALE MALE
## I    154  417
## D    104  265
## 
## $allele_rel
##                I         D
## FEMALE 0.2697023 0.2818428
## MALE   0.7302977 0.7181572
# We use margin=1 to compare relative frequencies inside grouping variables

To explore the relationships between the variants and the clinical variables, we worked with the genotype groups previously established. A list of all the tests of interest was created with:

genotype_groups <- c("NFKB1_GROUP", "NFKBIA_GROUP", "NFKBIZ_GROUP")

genotypes_clin_associations <- list()

for (genotypes in genotype_groups) {
  for (var in CAT_CLIN_VAR) {
    genotypes_clin_associations[[genotypes]][[var]] <- chisq.test(data[[genotypes]], data[[var]], correct=FALSE)
  }
}

Accessing individual tests in the console with:

genotypes_clin_associations$NFKB1_GROUP$SEX
## 
##  Pearson's Chi-squared test
## 
## data:  data[[genotypes]] and data[[var]]
## X-squared = 0.77279, df = 1, p-value = 0.3794

Differences in age among groups were tested with:

for (group in genotype_groups) {
  print(t.test(data$AGE ~ data[[group]]))
}

These results were collected for NFKB1, NFKBIA and NFKBIZ separately in the Supplementary Tables 1A, 1B and 1C, respectively.

The main finding was a significant association of the NFKBIZ rs3217713 polymorphism to values of D-Dimer above 2000 ng/mL, as seen in:

genotypes_clin_associations$NFKBIZ_GROUP$DD_gt2000
## 
##  Pearson's Chi-squared test
## 
## data:  data[[genotypes]] and data[[var]]
## X-squared = 7.0801, df = 1, p-value = 0.007794

We can determine the nature of this association examining the residuals:

genotypes_clin_associations$NFKBIZ_GROUP$DD_gt2000$residuals
##                  data[[var]]
## data[[genotypes]]         NO        YES
##             ID+DD  1.0496576 -1.8359726
##             II    -0.8014681  1.4018605

Finding that the II genotype is found more frequently than it could be expected in patients with D-Dimer above 2000 ng/mL.

The odds ratio was computed with:

fit <- glm(data$DD_gt2000 ~ data$NFKBIZ_GROUP, family="binomial")
exp(cbind(coef(fit), confint(fit)))
##                                 2.5 %   97.5 %
## (Intercept)         0.208000 0.133444 0.311754
## data$NFKBIZ_GROUPII 1.959657 1.200928 3.279022

4.6 Creating Figure 1

The information we are going to represent is the proportion of patients with each NFKBIZ rs3217713 genotype that have D-Dimer levels above 2000 ng/mL, being the association of the II polymorphism to high D-Dimer levels a major finding in our study. The frequencies we are representing are:

freqs <- variant_freqs(data$NFKBIZ, data$DD_gt2000, 1)$genotype_rel
freqs <- t(freqs); freqs
##        group
## variant        NO       YES
##      DD 0.8571429 0.1428571
##      ID 0.8248175 0.1751825
##      II 0.7104247 0.2895753

Specifically, we are interested in those of the “YES” group (D-Dimer levels above 2000 ng/mL).

With them, we can now elaborate the plot that corresponds to Figure 1:

freqs <- as.data.frame(freqs) %>% 
  rownames_to_column("GENOTYPE") %>% 
  pivot_longer(cols=-GENOTYPE, names_to="DD_gt2000", values_to="FREQUENCY") %>%
  mutate_if(is.character, as.factor)
figure_1 <- freqs %>% 
  filter(DD_gt2000=="YES") %>% 
  ggplot(aes(x=fct_rev(GENOTYPE), y=FREQUENCY, fill=GENOTYPE)) +
  geom_bar(stat="identity", show.legend=FALSE, color="black", size=1.5) +
  geom_text(aes(label=round(FREQUENCY, 2)), vjust=-0.80, size=7.5) +
  ylim(0, 0.45) +
  scale_fill_manual(values=c("#f8e5f8", "#f8e5f8", "#e699e0")) +
  labs(x=element_blank(), y="D-dimer > 2000 ng/mL proportion") +
  geom_signif(comparisons = list(c("II","DD")), y_position = 0.35,
              tip_length = 0, vjust = -1, size=1, annotations = "II vs. ID+DD: p = 0.0078", textsize = 7) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, margin=margin(b=20), face="bold", size=20),
    axis.title.y = element_text(face="bold", size=20, margin=margin(r=20)),
    axis.text.x = element_text(face="bold", size=17),
    axis.text.y = element_text(face="bold", size=17),
    strip.text.x = element_text(face="bold")) +
  theme(plot.margin = unit(c(0.3,0.2,0.2,0.2), "cm"))

figure_1

5 Expression study

To explore a functional role of the NFKBIZ rs3217713 variant, we performed an expression study of the NFKBIZ transcript in the blood of some of the patients in our cohort (N = 52). Expression was determined in Ct values —the number of qPCR cycles needed to cross a threshold of significant fluorescent signal—, which are inversely proportional to the amount of transcript of the gene. The Ct value for each patient was measured in triplicate, computing the correspondent mean for each of them. The expression of the ACTB gene in Ct values was also determined; as it is a housekeeping gene with constant expression, it allows us to normalize the Ct values of the NFKBIZ gene to account for experimental variability. This can be done by dividing the Ct value of NFKBIZ by the Ct value of ACTB (computing a Ct ratio), for each patient. Again, this ratio is inversely proportional to the expression. These calculations were performed off the script and are our starting point for this last section.

We load the data (which are gathered in Supplementary Table 2) with:

NFKBIZ_expression <- read_excel("NFKBIZ_expression.xlsx")

We want to study the differences in expression as a function of the groups previously determined in the genetic association study:

NFKBIZ_expression <- NFKBIZ_expression %>% 
  mutate(GENOTYPE = ifelse(GENOTYPE=="II", "II", "ID+DD")) %>% 
  mutate_if(is.character, as.factor)

We can visually represent the data with a box plot:

p <- NFKBIZ_expression %>% 
  ggplot(aes(x=fct_rev(GENOTYPE), y=CT_RATIO, fill=GENOTYPE)) +
  geom_boxplot() +
  theme_bw() +
  ylab("NFKBIZ/ACTB Ct ratio") +
  scale_fill_manual(values=c("#f8e5f8", "#e699e0")) +
  theme(
    legend.position="none",
    axis.title.y = element_text(face="bold", size=16, margin=margin(r=20)),
    axis.title.x = element_blank(),
    axis.text.x = element_text(face="bold", size=14),
    axis.text.y = element_text(face="bold", size=14),
    strip.text.x = element_text(face="bold"))

ggplotly(p) %>% config(displayModeBar = FALSE)

A necessary step before performing the t-test is checking for the normality of the expression data (independence, random sampling conditions are met), which was done executing a Shapiro-Wilk test:

shapiro.test(NFKBIZ_expression$CT_RATIO)
## 
##  Shapiro-Wilk normality test
## 
## data:  NFKBIZ_expression$CT_RATIO
## W = 0.97354, p-value = 0.2969

Accepting the normal distribution (null hypothesis; p > 0.05). The homogeneity of variances was tested with an F-test:

var.test(NFKBIZ_expression$CT_RATIO ~ NFKBIZ_expression$GENOTYPE)
## 
##  F test to compare two variances
## 
## data:  NFKBIZ_expression$CT_RATIO by NFKBIZ_expression$GENOTYPE
## F = 2.1737, num df = 20, denom df = 30, p-value = 0.0529
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9902134 5.1051034
## sample estimates:
## ratio of variances 
##           2.173677

Which would lead to accepting the null hypothesis of equal variances, but with a rather inconclusive p-value.

The t-test with equal variances was:

t.test(NFKBIZ_expression$CT_RATIO ~ NFKBIZ_expression$GENOTYPE, var.equal=TRUE)  # Significant
## 
##  Two Sample t-test
## 
## data:  NFKBIZ_expression$CT_RATIO by NFKBIZ_expression$GENOTYPE
## t = -2.0715, df = 50, p-value = 0.04349
## alternative hypothesis: true difference in means between group ID+DD and group II is not equal to 0
## 95 percent confidence interval:
##  -0.099653442 -0.001538263
## sample estimates:
## mean in group ID+DD    mean in group II 
##            1.267043            1.317639

And the t-test with no assumption of variance equality was:

t.test(NFKBIZ_expression$CT_RATIO ~ NFKBIZ_expression$GENOTYPE) # Defaults to Welch's approximation
## 
##  Welch Two Sample t-test
## 
## data:  NFKBIZ_expression$CT_RATIO by NFKBIZ_expression$GENOTYPE
## t = -1.9261, df = 32.316, p-value = 0.06292
## alternative hypothesis: true difference in means between group ID+DD and group II is not equal to 0
## 95 percent confidence interval:
##  -0.104081416  0.002889711
## sample estimates:
## mean in group ID+DD    mean in group II 
##            1.267043            1.317639

Both results do not give an unanimous statement, being around the significance level of 0.05. We chose to declare non-significance in favor of being conservative, but a statistical trend is seen nevertheless. This trend refers to a higher mean of the Ct NFKBIZ/ACTB ratio in patients with the NFKBIZ rs3217713 II genotype, thus suggesting a lower expression of NFKBIZ in this group and indicating a possible functional role for the variant.

Session info

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotly_4.10.0       ggsignif_0.6.3      epitools_0.5-10.1  
##  [4] HardyWeinberg_1.7.5 nnet_7.3-17         Rsolnp_1.16        
##  [7] mice_3.14.0         forcats_0.5.1       stringr_1.4.0      
## [10] dplyr_1.0.9         purrr_0.3.4         readr_2.1.2        
## [13] tidyr_1.2.0         tibble_3.1.7        ggplot2_3.3.6      
## [16] tidyverse_1.3.1     readxl_1.4.0       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.3        sass_0.4.1        jsonlite_1.8.0    viridisLite_0.4.0
##  [5] modelr_0.1.8      bslib_0.3.1       assertthat_0.2.1  highr_0.9        
##  [9] cellranger_1.1.0  yaml_2.3.5        pillar_1.7.0      backports_1.4.1  
## [13] lattice_0.20-45   glue_1.6.2        digest_0.6.29     rvest_1.0.2      
## [17] colorspace_2.0-3  htmltools_0.5.2   pkgconfig_2.0.3   broom_0.8.0      
## [21] haven_2.5.0       scales_1.2.0      tzdb_0.3.0        generics_0.1.2   
## [25] farver_2.1.0      ellipsis_0.3.2    withr_2.5.0       lazyeval_0.2.2   
## [29] cli_3.3.0         magrittr_2.0.3    crayon_1.5.1      evaluate_0.15    
## [33] fs_1.5.2          fansi_1.0.3       MASS_7.3-58       xml2_1.3.3       
## [37] truncnorm_1.0-8   tools_4.2.1       data.table_1.14.2 hms_1.1.1        
## [41] lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1      compiler_4.2.1   
## [45] jquerylib_0.1.4   rlang_1.0.2       grid_4.2.1        rstudioapi_0.13  
## [49] htmlwidgets_1.5.4 crosstalk_1.2.0   labeling_0.4.2    rmarkdown_2.14   
## [53] gtable_0.3.0      DBI_1.1.2         R6_2.5.1          lubridate_1.8.0  
## [57] knitr_1.39        fastmap_1.1.0     utf8_1.2.2        stringi_1.7.6    
## [61] parallel_4.2.1    Rcpp_1.0.8.3      vctrs_0.4.1       dbplyr_2.1.1     
## [65] tidyselect_1.1.2  xfun_0.31